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Abstract 

We use neural networks to perform retrievals 
of temperature and water fractions from sim- 
ulated clear air radiances for the Atmospheric 
Infared Sounder (AIRS). Neural networks al- 
low us to make effective use of the large AIRS 
channel set, and give good performance with 
noisy input. We retrieve surface temperature, 
air temperature at 64 distinct pressure levels, 
and water fractions at 50 distinct pressure lev- 
els. Using 728 temperature and surface sensi- 
tive channels, the RMS error for temperature 
retrievals with 0.2K input noise is 1.2K. Us- 
ing 586 water and temperature sensitive chan- 
nels, the mean error with 0.2K input noise is 
16%. Our implementation of backpropagation 
training for neural networks on the 16,000- 
processor MasPar MP-1 runs at a rate of 90 
million weight updates per second, and al- 
lows us to train large networks in a reasonable 
amount of time. Once trained, the network 
can be used to perform retrievals quickly on a 
workstation of moderate power. 


1 Introduction 

The next generation of NASA earth viewing 
satellites on Earth Observing System (EOS) 
platforms will produce a deluge of raw data 
that must be processed into products that 
describe the state of the earth and its at- 
mosphere over time. Satellite instruments 
that probe the atmosphere measure radiances 
over a number of channels, and this informa- 
tion must be “inverted” to obtain information 
about the atmospheric state, such as the tem- 
perature, humidity, and composition. 

The Atmospheric Infrared Sounder (AIRS) 
[3], currently under development, should pro- 
vide both higher accuracy and vertical reso- 
lution than the present operational sounders 
(HIRS/MSU) [10], and lead to higher fore- 
casting skill and a long term accurate mea- 
sure of climate change. The AIRS instru- 
ment will contain upwards of 4000 channels 
at a much higher spectral resolution than the 
currently operational HIRS instrument, which 
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has 20 channels. The optimum use of these 
data for atmospheric sounding in a cost ef- 
fective way may require completely new tech- 
niques, as existing methods for current instru- 
ments may not be transferable in a straight- 
forward manner. Traditional retrieval (or in- 
version) techniques are computationally inten- 
sive, especially non-linear techniques that re- 
quire several iterative calculations of the chan- 
nel radiances. It is estimated that the AIRS 
will require one of the most computationally 
intensive data systems on EOS. 

To address these new computational chal- 
lenges, we have implemented a backpropaga- 
tion training algorithm on the Maspar MP-1 
at Goddard Space Flight Center to train neu- 
ral networks to perform atmospheric retrievals 
of temperature and water profiles from simu- 
lated clear air radiances for the AIRS instru- 
ment. [The problem of cloudy atmospheres is 
a topic of future work not treated here.] These 
neural networks allow us to make effective use 
of the large AIRS channel set, give good per- 
formance with noisy input data, and allow for 
very fast processing even with very large num- 
bers of channels. 

We have found that the backpropagation 
code maps very well to the Maspar, and we 
have obtained network training rates of 93 mil- 
lion connection updates per second (CUPS) in 
single precision [1]. Once such a network has 
been trained on the Maspar, it can be down- 
loaded to a workstation where the time to ob- 
tain retrievals is the time to perform three ma- 
trix multiplies - of order less than 0.5 sec with 
a thousand input channels. (On the Maspar 
the retrieval time is at least an order of mag- 
nitude faster). 

The accuracy of the results obtained with 
our neural networks are quite competitive 


with other retrieval methods. Using 728 tem- 
perature and surface sensitive channels, and 
with 0.2K std noise added to the input bright- 
ness temperatures, the neural network has an 
overall RMS error retrieving 64 pressure levels 
of 1.22K. Using 586 water, surface, and tem- 
perature sensitive channels, and with 0.2K std 
noise added, the neural network has an overall 
error retrieving 50 pressure levels of 16% [2]. 

In order to better understand retrieval per- 
formance, we perform a sensitivity analysis of 
trained networks. This analysis is useful in se- 
lecting what sets of channels are to be used, in 
a process of iterative refinement, and in many 
cases shows a close correspondence to plots of 
weighting functions (discussed in the next sec- 
tion). 

In the sequel we describe the atmospheric 
retrieval problem, show how we use neural 
networks to solve the problem, describe the 
datasets used in training the networks, and 
present a number of representative results. We 
also describe the method of sensitivity analy- 
sis for evaluating the effectiveness of input sets 
to a neural network. 


2 Atmospheric Retrievals 

The problem of atmospheric retrievals [7], [5] 
(the “inverse problem”) is to take as input 
the radiances at a specified set of frequency 
channels measured by a sensor on a satellite 
above the top of the atmosphere and compute 
the temperature or water profiles of the atmo- 
sphere (as a function of pressure) that gave 
rise to those radiances. 

Associated with the inverse problem is the 
“forward problem” of computing the radiances 
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at the top of the atmosphere generated by 
layers of molecules in local thermal equilib- 
rium from the surface up through the atmo- 
spheric column in the sensor’s field of view. 
(We refer to this column as a temperature pro- 
file.) Assuming a plane parallel atmosphere 
in local thermodynamic equilibrium and neg- 
ligible scattering, and no instrument function 
one can write the monochromatic radiance at 
nadir at the top of the atmosphere as 


R v = e v B v {T.)T v {P„\T{P)]) 

+ ^ dlDP wm BAT{ P )] 

JlnP, 


d\nP 


where e„ is the emissivityof the surface s, and 
the contribution of reflected radiation which is 
negligible at most frequencies of interest has 
not been included. B V (T) is the Planck func- 
tion for emitted radiance of a blackbody at 
frequency v and temperature T, 


£„(T) = 1.19 x 10 -5 


exp [1.439z//T] — 1 


The quantity T u (P a , [T(P / )]) is the atmo- 
spheric transmittance from the surface at 
pressure P s to the top of the atmosphere at 
pressure P which is the fraction of photons of 
frequency v emitted at the surface P s that ar- 
rive at the sensor at altitude P . The quantity 
i s the weighting function for the 
frequency v and when multiplied by d\nP de- 
scribes the fraction of photons of frequency v 
emitted in the layer between pressure P and 
P + dP that reach the top of the atmosphere. 
Fig. 1 [3] shows a few of the several thou- 
sand weighting functions available from the 
AIRS instrument and indicates how a weight- 
ing function can be associated with a narrow 


vertical region of the atmosphere. The no- 
tation ( P , [T(P')]) 35 the argument of -fy is 
used to stress that it is functional of the pro- 
file T(P') between P and P and a function of 
P. 


Present retrieval systems are most eas- 
ily classified as being either linear regression 
techniques or non-linear iterative techniques. 
Both techniques can use varying amounts of 
statistics for regularizing their solutions, as 
well as varying amounts of the forward prob- 
lem radiative transfer. The linear regression 
approach is dependent on a very good first 
guess in order to be in the linear regime for the 
regression. The non-linear iterative method 
does not require such a good first guess, but 
does require time-consuming forward problem 
calculations. In addition, it is not clear if the 
non-linear iterative approach can coherently 
use all the information in the AIRS channel ra- 
diances without numerical problems. It may 
also be possible to iterate the linear regres- 
sion approach, however this would result in 
the need to iteratively calculate the forward 
problem for a very large number of channels, 
introducing a very heavy computational bur- 
den. 


3 Neural Networks 


We use a three-layer feed-forward neural net- 
work, batch trained with a modified back- 
propagation algorithm [6], [8] with an adap- 
tive learning rate. This network can be repre- 
sented as 

Y = 

F 3 (W 3 F 2 (W 2 F 1 (W 1 X + Bj) + B 2 ) + B 3 ), 

where each F{ maps matrices to matrices, ele- 
ment by element, by applying a transfer func- 
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tion to each matrix element and the matrices 
shown in boldface type are combined by ma- 
trix multiplication and addition. The map- 
ping F t is often referred to as a layer, with 
the weight matrices representing connections 
between layers. We use the hyperbolic tan- 
gent as a transfer function in the first two 
layers, and a linear function in the third. 
The input matrix X is of size (row x col) 
Win ^ training and the output matrix is of 
size Tiout ^ ^ training * The Wj are weight ma- 
trices of size respectively ni x rti n , n 2 x m, 
and n out x n 2 . The B t are bias matrices of 
respective sizes n x x n tra ining, n 2 x n training , 
and n out x n tra i n i ng composed of single bias 
column vectors of respectively size n 2 , and 
^ out replicated n training times to build the bias 
matrices. The quantities n l7l , ni, n 2 , n outy and 
n training are the number of input units (fre- 
quency channels), the number of first layer 


hidden units, the number of second layer hid- 
den units, the number of output units (pres- 
sure levels), and the number of examples in 
the training set. 

The networks we use for temperature re- 
trievals have one input component for each 
instrument channel, and one output compo- 
nent for each AIRS pressure level. The first 
layer has between 90 and 108 transfer func- 
tions, the second between 60 and 72 transfer 
functions, and the output layer has a linear 
function for each pressure level. For water re- 
trievals we have used 90 transfer functions in 
the first layer and 60 in the second layer. 

Back-propagation training is a variation of 
gradient descent, in which weight and bias 
vectors are incrementally adjusted in an at- 
tempt to match the network output with a 
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set of training examples. This training set is 
a set of pairs, where each pair is an input to- 
gether with the desired output. A single pre- 
sentation of all the training data and corre- 
sponding weight and bias adjustment is called 
an epoch. Training consists of a sequence of 
epochs, and typically continues until the sum- 
squared error is acceptable or some resource 
limit is encountered. Training is a computa- 
tionally intensive process for non-trivial net- 
works. Although training is slow, applying a 
trained net is very fast, with the runtime being 
dominated by the time for the three matrix- 
vector multiplies. 

It is convenient in the case of temperature 
retrievals to convert radiances to bright- 
ness temperatures ©„ according to the relation 
#„(©„) = R v [9]. The brightness temperature 
is the temperature a blackbody would be at to 
produce the radiance R u . By doing this the 
large dynamic range of radiances is reduced 
to a much smaller dynamic range of bright- 
ness temperature. Further, each element of 
the input and output vector pairs are scaled 
to be differences from the mean values over the 
training set, and are divided by the standard 
deviation of the training set. This “normal- 
izes” the inputs and outputs to a useful dy- 
namic range for the transfer functions used. 

We have developed a backpropagation code 
for the 128 x 128 processor MasPar MP-1 at 
the Goddard Space Flight Center in mpl (Mas- 
par’s parallel extension of C), which makes 
extensive use of the Maspar linear algebra li- 
brary. This code efficiently handles the virtu- 
alization needed to map very large networks 
of many tens of thousands of weights and bi- 
ases across the 16384 processing elements of 
the machine. Originally the code was written 
completely in double precision (64 bits) but 
since the results were found to be highly im- 


mune to noise in the data sets, a single preci- 
sion version is now being used. Profiling tests 
show the code spends 95% of the time per- 
forming matrix multiplications, for which the 
Maspar routines are highly optimized. We are 
observing execution rates of 93 million weight 
updates a second [1] on typical datasets. 


4 Datasets for Training 


Datasets for training and testing are gener- 
ated from the set of 1761 TIGR profiles [4] 
of temperature and water using the radiative 
transfer equation, to obtain corresponding ra- 
diances for the entire AIRS channel set. Thus 
the physics of the problem is built in by (1) the 
judicious selection of a large representative set 
of profiles and (2) the radiative transfer equa- 
tion that gives the matching radiances. The 
TIGR profiles have been interpolated from the 
original 40 levels to either 66 TOVS pressure 
levels (for earlier experiments) or 64 TOVS 
pressure levels (as used in the AIRS science 
teams “write test”). The retrieved quantities 
are the temperatures and water amounts in 
the 64 intervening slabs with an additional el- 
ement for the surface temperature, which may 
be different from the lowest slab. The surface 
emissivity is assumed to be one, for these ex- 
periments. 

Our general method is to partition a dataset 
into training and extrapolation sets. The net 
is trained on the training set, and is then 
tested with the extrapolation set, both with 
and without noise; the noise inputs have a nor- 
mal distribution and 0.2I\ standard deviation. 
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5 Results 


In this section we present representative re- 
sults for several profile and channel sets. In 
general, training runs were stopped when the 
RMS training error stopped showing signifi- 
cant improvement; this occurred after on the 
order of 100,000 epochs. Once network pa- 
rameters (adaptive learning parameters, sizes 
of hidden layers, and initial distributions) are 
fixed in a useful range, different sets of random 
initial weights typically have a small effect on 
final RMS error. When the full set of TIGR 
profiles is divided into training and extrapo- 
lation sets of approximately equal size (with 
representatives from all latitudes in both sets) 
exchanging training and extrapolation subsets 
also has a small effect. The result for all the 
runs discussed are summarized in Table 1. 

In run 150, the 880 even numbered TIGR 
profiles were used for training and the 881 odd 
numbered TIGR profiles were used for test- 
ing the network. Input to the net is bright- 
ness temperature for 666 AIRS channels, se- 
lected for surface and air temperature sen- 
sitivity. Output is surface temperature and 
air temperature at 66 distinct pressure lev- 
els. The network has 108 hyperbolic tangent 
transfer functions in the first hidden layer, and 
72 hyperbolic tangent transfer functions in the 
second hidden layer. After 140,000 epochs, 
RMS training error is 1.20K, RMS extrapo- 
lation (testing) error is 1.26K, and RMS ex- 
trapolation error with 0.2K std noise is 1.44K. 
These results are shown in Fig. 2. After 
100,000 epochs of further training with noisy 
data (0.2K std noise added to the input data), 
RMS training error is 1.22K, RMS extrapola- 
tion error is 1.23K, and RMS extrapolation 
error with 0.2K std noise is 1.37K 


In the upper plot of Fig. 2, the temperature 
retrieval error at the surface and at each of 66 
pressure levels is shown. In the lower plot, the 
same set of errors is presented as 11 groups of 
6 pressure levels (the surface is still distinct, 
and is not grouped with any pressures levels). 
We do not have a completely satisfactory ex- 
planation for the small ’oscillations’ in the 66 
level plot. This pattern of fine variations ap- 
pears across a wide range of training sessions 
and channel sets. (Note the similarity between 
these small scale variations in the Fig. 2 and 
Fig. 3 plots.) One possible explanation is that 
these variations correspond to variations in 
the numbers of weighting functions available 
at different pressure levels. Another possibil- 
ity is that these may be an artifact of the fast 
transmittance code (as supplied by JPL for 
the AIRS science teams “write test”) that we 
use to generate brightness temperatures. This 
is a matter for further investigation. 

A sensitivity analysis of run 150 (discussed 
in the next section) is shown in Fig. 4. 
This analysis, together with similar results 
from other runs using the same channel set, 
indicated that channels with wavenumbers 
roughly between 750 and 1200 were not be- 
ing used by the network. This information, 
together with the relatively high error above 
the 50mb pressure level suggested changes to 
the channel set, which were incorporated in 
run 170. 

In run 170, the 880 even numbered TIGR 
profiles were used for training and the 881 odd 
numbered TIGR profiles were used for test- 
ing the network, as before. Input to the net 
is brightness temperature for 728 AIRS chan- 
nels, selected for surface and air temperature 
sensitivity, taking into account previous sen- 
sitivity analysis. Output is surface tempera- 
ture and air temperature at 64 distinct pres- 
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66 AIRS pressure levels, pressure in mb. 66 AIRS pressure levels, pressure in mb. 



Figure 2: RMS temperature errors for run 150. 




Figure 3: RMS temperature errors for run 170. 
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Run 

Net Size 

Epoch 

RMS errors (a) 

train 

test 

noise 

150 

666 x 108 x 72 x 67 

240,000 

1.22K 

1.23K 

1.37K 

170 

728 x 108 x 72 x 65 

160,000 

1.02K 

1.09K 

1.22K 

90 

586 x 90 x 60 x 50 

50,000 

13.2% 

15.0% 

15.9% 


Table 1: Summary of runs discussed. 


sure levels. 1 The network is the same size at 
the network for run 150. After 160,000 epochs, 
RMS training error is 1.02K, RMS extrapola- 
tion error is 1.09K, and RMS extrapolation er- 
ror with 0.2K std noise is 1.22K. These results 
are shown in Fig. 3. A slight improvement in 
noise performance of this network could prob- 
ably be realized by further training with noisy 
data. 

A sensitivity analysis of run 170 is shown 
in Fig. 5. Note that the ‘flat spot’ (the large 
group of unused middle channels) is much re- 
duced, but that there are still some unused 
channels. 

Fig. 6 shows some initial results for wa- 
ter retrievals. Input to the net is brightness 
temperatures for 586 AIRS channels, selected 
for both water and temperature sensitivity. 
The same set of TIGR profiles were used as 
in runs 150 and 170, while the network was 
slightly smaller, with 90 transfer functions in 
the first hidden layer and 60 in the second. 

1 We switched from 66 to 64 pressure levels to match 
conventions used for the AIRS science team “write 
test.” 


After 50,000 epochs, overall error for the first 
50 pressure levels (expressed as percentages) is 
13.2% training error, 15.0% extrapolation er- 
ror, and 15.9% extrapolation error when 0.2K 
std noise is added. 

As with more traditional methods of inter- 
polation, neural networks can both under- and 
over-fit. High training error or inability to 
converge on the training set is a sign of under- 
fitting, while poor performance on new data 
is a sign of over-fitting. The close correspon- 
dence between training and extrapolation er- 
rors on all the runs, and appropriate smooth- 
ness of retrieved profiles, suggest that the size 
of our hidden layers is not too large, and that 
we are not overfitting. It may be possible to 
use larger hidden layers to improve training 
and also (though to a lesser degree) extrap- 
olative behavior. 


6 Sensitivity Analysis 

Once a network has been trained we can ob- 
tain a measure of its dependency on the input 
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Figure 5: Sensitivity plot for run 170. 
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TTGR Profiles, Run 90, epoch 5e+04. H20 Percentage Error 



Figure 6 : % errors for H 2 0 for run 90. 


channel set by computing the Jacobian matrix 
of the partial derivatives of outputs with re- 
spect to inputs evaluated at a representative 
sample of profiles. In particular we have com- 
puted numerically by differences the quantity 

AT 7 

7 

where 7 indexes over the set of profiles in the 
dataset, is the number of profiles in the 
dataset, and A is the difference operator. If 
S tJ is large then on average over the set of all 
TIGR profiles frequency channel j has a large 
effect on temperature (water) in pressure layer 
z, while if it is small then the network has 
found little dependence of frequency channel 
j on the temperature (water) in pressure level 
i. 


In the plots of sensitivity analysis Figs. 4 
and 5 , channels run from left to right, with 
the lower wavenumbers to the left. Pressure 
levels run from front to back, with the surface 
at the back of the plot. The z axis represents 
sensitivity (the sum square of partials), aver- 
aged across all the training profiles. 

For many channels, sensitivity peaks corre- 
spond to weighting function peaks. The sen- 
sitivity plot looks much more ‘noisy 7 and this 
is to be expected. (The sensitivity plot for an 
untrained net looks much like uniform noise.) 
In effect, the net has discovered its own rep- 
resentation for the weighting functions, where 
information from groups of channels is used to 
retrieve information about a particular pres- 
sure level. We conjecture that the ‘noisy look- 
ing 5 sensitivity plot is inseparable from the 
network’s good performance on noisy input. 
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7 Conclusions 

We have demonstrated an application of back- 
propagation neural networks to the retrieval 
of accurate atmospheric temperature and wa- 
ter profiles, using the hundreds of channels of 
spectral information that will be available on 
the AIRS instrument. The prohibitive cost of 
training such large networks with large train- 
ing sets is ameliorated by an effective map- 
ping of the algorithm to the parallel architec- 
ture of the Maspar MP-1. The neural network 
allows us to make effective use of the large 
AIRS channel set, especially for better noise 
performance. Once the network is obtained it 
can be used to obtain very fast retrievals even 
with many input channels on modest compu- 
tational platforms. 

A sensitivity analysis of the network sug- 
gests ways we can refine the choice of chan- 
nels used by the network. In principle, one 
could take the entire AIRS channel set, train 
a net for (say) temperature retrievals, perform 
a sensitivity analysis on the resultant net, get 
a smaller set of temperature sensitive chan- 
nels, and use the smaller channel set to train 
a second net. 

There are a number of directions for further 
work. Our present results indicate it is likely 
that a somewhat larger net may have errors 
below IK. It may be that simultaneously re- 
trieving temperature and water using a large 
combined channel set will give even better re- 
sults than so far obtained. The retrieval of 
other atmospheric parameters, such as O 3 , are 
promising areas for further investigation, as 
are the potential application of neural nets to 
cloudy atmospheres. 
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